Working with BigQuery tables and the Genomics API

Case Study: BRAF V600 mutations in CCLE cell-lines

In this notebook we'll show you how you might combine information available in BigQuery tables with sequence-reads that have been imported into Google Genomics. We'll be using the open-access CCLE data for this example.

You'll need to make sure that your project has the necessary APIs enabled, so take a look at the Getting started with Google Genomics page, and be sure to also have a look at this Getting started with the Genomics API tutorial notebook available on github.

We'll be using the Google Python API client so we'll need to install that first using the pip package manager.

NOTE that Datalab is currently using an older version of the oauth2client (1.4.12) and as a result we need to install an older version of the google-api-python-client that supports it.


In [24]:
!pip install --upgrade google-api-python-client==1.4.2


Requirement already up-to-date: google-api-python-client==1.4.2 in /usr/local/lib/python2.7/dist-packages
Cleaning up...

Next we're going to need to authenticate using the service account on the Datalab host.


In [25]:
from httplib2 import Http
from oauth2client.client import GoogleCredentials
credentials = GoogleCredentials.get_application_default()
http = Http()
credentials.authorize(http)


Out[25]:
<httplib2.Http at 0x7f34023a2710>

Now we can create a client for the Genomics API. NOTE that in order to use the Genomics API, you need to have enabled it for your GCP project.


In [26]:
from apiclient import discovery
ggSvc = discovery.build ( 'genomics', 'v1', http=http )

We're also going to want to work with BigQuery, so we'll need the biguery module. We will also be using the pandas and time modules.


In [27]:
import gcp.bigquery as bq
import pandas as pd
import time

The ISB-CGC group has assembled metadata as well as molecular data from the CCLE project into an open-access BigQuery dataset called isb-cgc:ccle_201602_alpha. In this notebook we will make use of two tables in this dataset: Mutation_calls and DataFile_info. You can explore the entire dataset using the BigQuery web UI.

Let's say that we're interested in cell-lines with BRAF V600 mutations, and in particular we want to see if there is evidence in both the DNA-seq and the RNA-seq data for these mutations. Let's start by making sure that there are some cell-lines with these mutations in our dataset:


In [28]:
%%sql

SELECT CCLE_name, Hugo_Symbol, Protein_Change, Genome_Change 
FROM [isb-cgc:ccle_201602_alpha.Mutation_calls] 
WHERE ( Hugo_Symbol="BRAF" AND Protein_Change CONTAINS "p.V600" )
ORDER BY Cell_line_primary_name
LIMIT 5


Out[28]:
CCLE_nameHugo_SymbolProtein_ChangeGenome_Change
8305C_THYROIDBRAFp.V600Eg.chr7:140453136A>T
8505C_THYROIDBRAFp.V600Eg.chr7:140453136A>T
A375_SKINBRAFp.V600Eg.chr7:140453136A>T
A673_BONEBRAFp.V600Eg.chr7:140453136A>T
A101D_SKINBRAFp.V600Eg.chr7:140453136A>T

(rows: 5, time: 0.5s, cached, job: job_bEFKS7D_j7mYWVeGJBmzD60iFZw)

OK, so let's get the complete list of cell-lines with this particular mutation:


In [29]:
%%sql --module get_mutated_samples

SELECT CCLE_name 
FROM [isb-cgc:ccle_201602_alpha.Mutation_calls] 
WHERE ( Hugo_Symbol="BRAF" AND Protein_Change CONTAINS "p.V600" )
ORDER BY Cell_line_primary_name

In [30]:
r = bq.Query(get_mutated_samples).results()
list1 = r.to_dataframe()
print " Found %d samples with a BRAF V600 mutation. " % len(list1)


 Found 54 samples with a BRAF V600 mutation. 

Now we want to know, from the DataFile_info table, which cell lines have both DNA-seq and RNA-seq data imported into Google Genomics. (To find these samples, we will look for samples that have non-null readgroupset IDs from "DNA" and "RNA" pipelines.)


In [31]:
%%sql --module get_samples_with_data

SELECT
  a.CCLE_name AS CCLE_name
FROM (
  SELECT
    CCLE_name
  FROM
    [isb-cgc:ccle_201602_alpha.DataFile_info]
  WHERE
    ( Pipeline CONTAINS "DNA"
      AND GG_readgroupset_id<>"NULL" ) ) a
JOIN (
  SELECT
    CCLE_name
  FROM
    [isb-cgc:ccle_201602_alpha.DataFile_info]
  WHERE
    ( Pipeline CONTAINS "RNA"
      AND GG_readgroupset_id<>"NULL" ) ) b
ON
  a.CCLE_name = b.CCLE_name

In [32]:
r = bq.Query(get_samples_with_data).results()
list2 = r.to_dataframe()
print " Found %d samples with both DNA-seq and RNA-seq reads. " % len(list2)


 Found 265 samples with both DNA-seq and RNA-seq reads. 

Now let's find out which samples are in both of these lists:


In [33]:
list3 = pd.merge ( list1, list2, how='inner', on=['CCLE_name'] )
print " Found %d mutated samples with DNA-seq and RNA-seq data. " % len(list3)


 Found 7 mutated samples with DNA-seq and RNA-seq data. 

No we're going to take a closer look at the reads from each of these samples. First, we'll need to be able to get the readgroupset IDs for each sample from the BigQuery table. To do this, we'll define a parameterized function:


In [34]:
%%sql --module get_readgroupsetid

SELECT Pipeline, GG_readgroupset_id 
FROM [isb-cgc:ccle_201602_alpha.DataFile_info]
WHERE CCLE_name=$c AND GG_readgroupset_id<>"NULL"

Let's take a look at how this will work:


In [35]:
aName = list3['CCLE_name'][0]
print aName
ids = bq.Query(get_readgroupsetid,c=aName).to_dataframe()
print ids


COLO783_SKIN
                Pipeline        GG_readgroupset_id
0  broad.mit.edu__DNASeq  CJKPhaq1GhC4rZSVj4TMoIIB
1  broad.mit.edu__RNASeq  CJKPhaq1GhDN4avdoaTXsKcB

Ok, so we see that for this sample, we have two readgroupset IDs, one based on DNA-seq and one based on RNA-seq. This is what we expect, based on how we chose this list of samples.

Now we'll define a function we can re-use that calls the GA4GH API reads.search method to find all reads that overlap the V600 mutation position. Note that we will query all of the readgroupsets that we get for each sample at the same time by passing in a list of readGroupSetIds. Once we have the reads, we'll organized them into a dictionary based on the local context centered on the mutation hotspot.


In [36]:
chr = "7"
pos = 140453135
width = 11
rgsList = ids['GG_readgroupset_id'].tolist()

def getReads ( rgsList, pos, width):
  
  payload = { "readGroupSetIds": rgsList,
            "referenceName": chr,
            "start": pos-(width/2),
            "end": pos+(width/2),
            "pageSize": 2048      
  }
  r = ggSvc.reads().search(body=payload).execute()
  
  context = {}
  for a in r['alignments']:
    rgsid = a['readGroupSetId']
    seq = a['alignedSequence']
    seqStartPos = int ( a['alignment']['position']['position'] )
    relPos = pos - (width/2) - seqStartPos
    if ( relPos >=0 and relPos+width<len(seq) ):
      # print rgsid, seq[relPos:relPos+width]
      c = seq[relPos:relPos+width]
      if (c not in context):
        context[c] = {}
        context[c][rgsid] = 1
      else:
        if (rgsid not in context[c]):
          context[c][rgsid] = 1
        else:
          context[c][rgsid] += 1

  for c in context:
    numReads = 0
    for a in context[c]:
      numReads += context[c][a]
    # write it out only if we have information from two or more readgroupsets
    if ( numReads>3 or len(context[c])>1 ):
      print "     --> ", c, context[c]

Here we define the position (0-based) of the BRAF V600 mutation:


In [37]:
chr = "7"
pos = 140453135
width = 11

OK, now we can loop over all of the samples we found earlier:


In [38]:
for aName in list3['CCLE_name']: 
  print " "
  print " "
  print aName
  r = bq.Query(get_readgroupsetid,c=aName).to_dataframe()
  for i in range(r.shape[0]):
    print "  ", r['Pipeline'][i], r['GG_readgroupset_id'][i]
  rgsList = r['GG_readgroupset_id'].tolist()
  getReads ( rgsList, pos, width)


 
 
COLO783_SKIN
   broad.mit.edu__DNASeq CJKPhaq1GhC4rZSVj4TMoIIB
   broad.mit.edu__RNASeq CJKPhaq1GhDN4avdoaTXsKcB
     -->  ATTTCACTGTA {u'CJKPhaq1GhC4rZSVj4TMoIIB': 47, u'CJKPhaq1GhDN4avdoaTXsKcB': 10}
     -->  ATTTCTCTGTA {u'CJKPhaq1GhC4rZSVj4TMoIIB': 100, u'CJKPhaq1GhDN4avdoaTXsKcB': 30}
 
 
K029AX_SKIN
   broad.mit.edu__DNASeq CJKPhaq1GhCj-6WMobnvxCA
   broad.mit.edu__RNASeq CJKPhaq1GhDHzLyn-_j7w18
     -->  ATTTCACTGTA {u'CJKPhaq1GhDHzLyn-_j7w18': 8, u'CJKPhaq1GhCj-6WMobnvxCA': 63}
     -->  ATTTCTCTGTA {u'CJKPhaq1GhDHzLyn-_j7w18': 13, u'CJKPhaq1GhCj-6WMobnvxCA': 119}
 
 
MALME3M_SKIN
   broad.mit.edu__RNASeq CJKPhaq1GhCMvqyM-MWor2o
   broad.mit.edu__DNASeq CJKPhaq1GhCj8MDayraUjOUB
     -->  ATTTCACTGTA {u'CJKPhaq1GhCj8MDayraUjOUB': 48, u'CJKPhaq1GhCMvqyM-MWor2o': 9}
     -->  ATTTCTCTGTA {u'CJKPhaq1GhCj8MDayraUjOUB': 126, u'CJKPhaq1GhCMvqyM-MWor2o': 37}
 
 
SIGM5_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE
   broad.mit.edu__DNASeq CJKPhaq1GhDH0sLg9qTgt5QB
   broad.mit.edu__RNASeq CJKPhaq1GhD_2PTBu4jiq5wB
     -->  ATTTCACTGTA {u'CJKPhaq1GhD_2PTBu4jiq5wB': 3, u'CJKPhaq1GhDH0sLg9qTgt5QB': 28}
     -->  ATTTCTCTGTA {u'CJKPhaq1GhD_2PTBu4jiq5wB': 6, u'CJKPhaq1GhDH0sLg9qTgt5QB': 44}
 
 
WM793_SKIN
   broad.mit.edu__DNASeq CJKPhaq1GhDauaiO54TV-lU
   broad.mit.edu__RNASeq CJKPhaq1GhDo3-DQ8s2qiAU
     -->  ATTTCACTGTA {u'CJKPhaq1GhDo3-DQ8s2qiAU': 11, u'CJKPhaq1GhDauaiO54TV-lU': 50}
     -->  ATTTCTCTGTA {u'CJKPhaq1GhDo3-DQ8s2qiAU': 13, u'CJKPhaq1GhDauaiO54TV-lU': 102}
 
 
WM88_SKIN
   broad.mit.edu__RNASeq CJKPhaq1GhDvot6FqNrO8bUB
   broad.mit.edu__DNASeq CJKPhaq1GhD85Zrppre74-0B
     -->  ATTTCACTGTA {u'CJKPhaq1GhDvot6FqNrO8bUB': 13, u'CJKPhaq1GhD85Zrppre74-0B': 47}
     -->  ATTTCTCTGTA {u'CJKPhaq1GhDvot6FqNrO8bUB': 17, u'CJKPhaq1GhD85Zrppre74-0B': 76}
 
 
WM1799_SKIN
   broad.mit.edu__RNASeq CJKPhaq1GhDrrMCJ0_zBuKkB
   broad.mit.edu__DNASeq CJKPhaq1GhDQ5u66urbHjGI
     -->  ATTTCTCTGTA {u'CJKPhaq1GhDrrMCJ0_zBuKkB': 42, u'CJKPhaq1GhDQ5u66urbHjGI': 143}

Notice that we consistently see greater read-depth in the DNA-seq data. Also all but the last sample are heterozygous for the V600 mutation, while WM1799_SKIN is homozygous. (Of course a proper analysis would also take into consideration the cigar information that is available with each read as well.)


In [ ]: